Report on the TREC-10 Experiment: Distributed Collections and Entrypage Searching
نویسندگان
چکیده
For our participation in TREC-10, we will focus on the searching distributed collections and also on designing and implementing a new search strategy to find homepages. Presented in the first part of this paper is a new merging strategy based on retrieved list lengths, and in the second part a development of our approach to creating retrieval models able to combine both Web page and URL address information when searching online service locations.
منابع مشابه
Report on the TREC-8 Experiment: Searching on the Web and in Distributed Collections
The Internet paradigm permits information searches to be made across wide-area networks where information is contained in web pages and/or whole document collections such as digital libraries. These new distributed information environments reveal new and challenging problems for the IR community. Consequently, in this TREC experiment we investigated two questions related to information searches...
متن کاملReport on the TREC-9 Experiment: Link-based Retrieval and Distributed Collections
The web and its search engines have resulted in a new paradigm, generating new challenges for the IR community which are in turn attracting a growing interest from around the world. The decision by NIST to build a new and larger test collection based on web pages represents a very attractive initiative. This motivated us at TREC-9 to support and participate in the creation of this new corpus, t...
متن کاملApplying Inference Networks to Multiple Collection Searching
The paper describes how to use inference networks to solve two problems in searching multiple collections: collection selection and result merging. The eeectiveness of the approaches is demonstrated with the INQUERY system and 3 gigabyte TREC collections.
متن کاملDistributed Multisearch and Resource Selection for the TREC Million Query Track
A distributed information retrieval system with resource‐selection and result‐set merging capability was used to search subsets of the GOV2 document corpus for the 2008 TREC Million Query Track. The GOV2 collection was partitioned into host‐name subcollections and distributed to multiple remote machines. The Multisearch demonstration application restricted each search to a fraction of the avail...
متن کاملLucene for n-grams using the CLUEWeb Collection
The ARSC team made modifications to the Apache Lucene engine to accommodate " go words, " taken from the Google Gigaword vocabulary of n‐grams. Indexing the Category " B " subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3‐grams. Phrase searching—or imposing an order on query terms—has traditionally been a...
متن کامل